ByteLevelBPETokenizer output seems weird
The merge.txt and vocab.json files I obtained are now not human readable.
The byte-level BPE converts all the Unicode code points into multiple byte-level characters:
「Unicodeコードポイント全てを複数のバイトレベルの文字に変換する」
1. Each Unicode code point is decomposed into bytes (1 byte for ASCII characters, and up to 4 bytes for UTF-8 Unicode code points)
2. Each byte value gets a "visible" character assigned to it from the beginning of the Unicode table.
So some characters get other representations, like for example the white space U+0020 becomes Ġ.
半角スペース U+0020 は文字に対して別の表現が割り当てられた例